Lzss Compression, or: How I Won Hugi Compo #2

This article describes my program for the second hugi size coding competition.

Task

The task was very simple. You had to write a COM program that displays a predefined text on the screen. The length of the text is 903 bytes, and it must be printed exactly byte per byte.

First of all, a compressor should be developed and then the small decompression routine (in real life you should do this simultaneously, because the com size depends on both code and compressed data).

Compression

I used the well-known lzss algorithm (perhaps lz77 is the correct name). The idea is to replace the repetitions with an offset/size pair, determining the position and size of the previous occurrence. The offset can refer to the position from the beginning of the file or from the current location backwards. To distinguish the normal characters from these offset/size pairs, we must use an extra bit (prefix).

First attempt:

- we must use a bit stream

- the length of the characters is 7 bits

- we use a fixed offset bit length and a fixed size bit length

- The compression algorithm is to search through the text from the beginning and find the longest repetitions in the limits of the offset/size bit lengths. Minimum size limit is 2. (This is a parameter, like offset, and size length in bits. We should try all possibilities for a specific coding.)

With (hopefully) optimal parameters, the patterns look like the following:

  7bit character                0 <7 bits>
  offset/size repetition        1 <9 bits (offset)> <3 bits (size)>

Simple decoder:

 Loop
    If GetBits(1)  = 0 Then
       Text[ Pos ] = GetBits(7)
       Pos = Pos + 1
    Else
       Offset = GetBits(9)
       Size = GetBits(3)
       For I = 1 to Size
           Text[ Pos ] = Text[ Pos - Offset ]
           Pos = Pos + 1

No end condition? We can win some bytes in the decoder if we use an end condition like DI >= 8000h (see decoder chapter), but we must append a $ to the input text for the dos print interrupt. Since this little decoder can't freeze, there is no problem when it goes beyond the normal input data.

I don't know the exact size for this compression, but in any case it is not good enough. Where can we optimize it?

We use repetitions only if the size is greater than N. (We will choose the value of N later.) The stored size should be (size-N) to be able to use the full range of x bits. If you make a histogram of the repetitions size parameters, you see that most of them are smaller ones. Another simple coding for the size parameter is to store (size-N) bits with the value 1 and a bit with the value 0 at the end. This way we only need one bit to code a N size repetition.

 Second design of the patterns (N=2)
  7bit character                0 <7 bits>
  offset/size repetition        1 <9 bits (offset)> 1*<(size-2) bits> 0

If you write your own lzss compression routine, you will probably question if this is the optimal compression for this pattern. What if you use another repetition in the beginning and lose some bits, but at the end you will get a smaller result? My first compressor went from the beginning to the end and used the longest repetitions.

Second attempt: Try all possible patterns, not only the longest repetitions, at every character. Bad method: plenty of cases, years to test them all. But I realized that compression depends only on earlier source text, not on the used patterns to compress it. For example, it doesn't influence the further compression if I use repetitions or 7bit characters instead, only the resulting bitstream would probably be longer.

New idea: Let's try to compress from the end to the beginning! We calculate the best compression for every character as it would be the first. Since we know the optimal compressed bitstream for all later positions (remember, we compress back to the beginning), we can calculate the resulting bitstream length for all possible patterns, and select the minimal one.

 size[pos] = min( pattern(N).bits + size[pos + pattern(N).codedbytes] )

For our current patterns:

  7bit character         -> bits = 1+7, codedbytes = 1
  offset/size repetition -> bits = 1+9+(size-2)+1, codedbytes = size

(We must try a repetition pattern with every possible size, not only with the longest!)

Those 9bit offsets are very big. What if we make two types of repetition patterns? A smaller and a larger one. If the pattern count is about the same for these two types, we distinguish them with one additional prefix bit best.

Third design of the patterns (N1=2, N2=3)

  7bit character                0 <7 bits>
  small offset/size rep.        1 0 <6 bits> 1*<(size-2) bits> 0
  large offset/size rep.        1 1 <9 bits> 1*<(size-3) bits> 0

We have a small character set, only few upper case letters, four punctuation marks, newline chars, space.

- the first char in the file is upper case
- all other upper case letters come after '?' and '.'
- almost all lines are 44 char long

Let's make a 5bit character set: 'a'..'y' ',' '.' '?' '-'. This is 25+4=29, we still have space for 3 chars. We need one for space, one for $ at the end and one char for special newline, when the line is smaller than 44. Before compression we have to convert the input text: lower case, filter out the newline chars, and $ at the end. Where line is not 44 char long, a special character is needed.

Final design of the patterns

  5bit character                0 <5 bits>
  small offset/size rep.        1 0 <6 bits> 1*<(size-2) bits> 0
  large offset/size rep.        1 1 <9 bits> 1*<(size-3) bits> 0

Compressed data is 395 bytes.

Decoder

We have a good compression algo, now need a small decoding routine. I will explain my final code, unfortunately I have no previous versions.

a. The first problem is the bit stream. We have a very lovely bit test instruction (386+). It can access a -32768..+32767 bit range from memory and return the state in cflag. I use BX for base pointer and BP as bit index.

b. I decided to implement the 5bit -> 7bit conversion in the first pass. The 5bit range has to parts: an alphabetical A..Y range, and the special 7 characters. The first part is simple to decode, but the second needs a lookup table. The smallest byte lookup instruction is xlat -> BX must point to the beginning of the 7byte loopkup table. Since SI=100h at startup, we can easily set BX to 100h, and hope for a non-destructive loopkup table as code at 100h. The first byte is something special because it can be in a range and we can use it for our purpose. The final lookup table:

         push    si              ;   newline      (negative byte) - 'Z'
         ror     dl,cl           ;   , $          (ascii code) - 'Z'
         rol     si,cl           ;   - space      (ascii code) - 'Z'
         mov     ah,0C5h         ;   . ?          (ascii code) - 'Z' - 32

Fortunately non-destructive, and we have a 100h on the stack.

         pop     bx              ; bx = 100h

c. Now we have to set up BP for the bit test instruction.

         mov     bp,(offset data-offset start)*8

d. We also need a destination pointer. The end condition will be a sign check, so it must be less than 8000h-textlength, and it must be larger than 100h+COMlength. We will use STOSB and MOVSB, so DI is the right register (CS=ES=DS), and BP is just fine for initialization.

         mov     di,bp           ; di = free space below 0x8000

e. DI is needed as source for the second pass.

         push    di

f. We can start the decoding pass. In the first versions I made a GetBits routine as in the example program above. Later, I had a better idea. I read 11 bits at once and after decoding we rewind as needed.

 decode: mov     cl,11           ; get 11 bits
 bitloop:
         bt      [bx],bp
         inc     bp
         rcr     ax,1            ; set msb in ax
         loop    bitloop
         shr     ax,6            ; first bit to cflag, others stay in ax

g. Now we have 10 bits in AX, and the first bit in cflag. This is very useful because we can easily decide if we have a repetition pattern or a 5bit character.

         jc      move            ; move token ?

h. 5bit->7bit converting is simple now, as we have the lookup table at [BX]. But, first we must rewind our bit pointer by 5bits because we used only 1+5 bits out of 11.

         sub     bp,5            ; go back 5bits
         and     al,31           ; only 5bit used (11-1-5=5)
         sub     al,25           ; special char?
         js      abc
         xlat                    ; lookup special char from start[0..6]
 abc:    add     al,90           ; add 'Z'
         stosb
         jmp     ok

i. In case of repetition: the 10 bits in AX are for the offset. The LSB decides the type and whether only 6 bits are used and whether rewind is necessary. I use CX=3 as the minimal repetition size, and decrement in case of small type repetition.

 move:   mov     cl,3            ; mincount = 3
         shr     ax,1            ; check next bit
         jc      size9           ; type0 or type1 token?
         sub     bp,cx           ; go back 3bits
         and     ax,63           ; only 6bit used (11-1-1-3=6)
         dec     cx              ; mincount = 2
 size9:

j. We have the offset in AX, now let's copy those bytes. There is one trick: after I copy the minimal size, I reuse the MOVSB part of the REP MOVSB to copy the rest.

         mov     si,di           ; source = (current pos)-(data from ax)
         sub     si,ax
 moving: rep movsb               ; first move mincount
         bt      [bx],bp
         inc     bp
         jc      moving+1        ; move one bytes until 0 bit found

k. The end condition of the first pass.

 ok:     or      di,di
         jns     decode          ; loop until di >= 0x8000

                                 ; PASS 2
         pop     si              ; si=stage1,di>=0x8000,bx=0x100,dx=0
         push    di
         cwd

j. Here si = pass1 7bit output, di can be used as pass2 output, and BX=100, CX=0, DX=0. The second part is much more compilcated. As you could see at the lookup table, we coded the '.' and '?' as '.'-32 and '?'-32 to distinguish them from the other special characters. Now we can specify 4 input ascii code ranges:

                 Range1   0..31      for '.' and '?'
                 Range2   32..63     for other non alphabetical characters
                 Range3   64..95     for 'A'..'Y'
                 Range4   128..255   for newline (signed!)

We have two converting modes:

             Normal - add 32 to Range1 and Range3 characters (or bitwise OR 32)
             Upcase - leave everything unchanged

Mode change conditions:

             Normal->Upcase - at startup and after Range1 characters
             Upcase->Normal - only after Range3 characters

From the range limits you can see the importance of the 5th bit. By converting we can bitwise OR the mode bit to the 5th bit, if we assume bit-value 1 for Normal mode and bit-value 0 for Upcase mode. For mode changes we use ADD for mode bit and the original 5th bit and use carry as the next mode bit.

However, newline chars must be detected before all this converting and must break the line loop which should have the default size of 44 chars. This bit mode is a little confusing. We can use a normal register like DL to this purpose, and use only the 5th bit. For normal mode DL=20h and for Upcase mode 0h, which is the initial mode.

 postadj:mov     cl,44           ; assume line will be 44 chars
 line:   lodsb
         or      al,dl           ; case dl=0  nothing done, upper case
                                 ; case dl=32
                                 ;   case al=A..Z -> a..z
                                 ;   case al=, $ - space -> nothing done
                                 ;   case al=.-32 ?-32 -> normal . ?
         js      newline         ; if special negative byte, break loop
         stosb                   ; store adjusted char

         add     dl,[si-1]       ; calc new dl
         shr     dl,1            ; when al=A..Z -> dl=32
         and     dl,20h          ; when al=, $ - space -> dl=dl
                                 ; when al=.-32 ?-32 -> dl=0
         loop    line
 newline:mov     ax,0A0Dh        ; store newline chars
         stosw

k. Wow, we processed one line, wrote the normal newline charaters and everything is ok, but wait, we don't have any end conditions as usual. We can use BX as a line counter, we have plenty of space for those max. 44 long lines, just decoding 256 lines should be enough.

         dec     bx              ; line counter
         jne     postadj

l. At last, we can display our text.

         pop     dx              ; print text
         mov     ah,9
         int     21h
         ret

m. Just insert the compressed data here.

 data:                           ; label for initializing BP